statfandomcom-20200213-history
Isotonic regression
Problem Given values a_1 , ..., a_n and their positive weights w_1 , ..., w_n , approximate them by y_1 , ... y_n as close as possible, subject to a set of constraints of kind y_i \ge y_j : : Given :: values \mathbf{a} \in \Bbb{R}^n , :: weights \mathbf{w} \in \Bbb{R}^n such that w_i>0 for all i, :: set of constraints E \subset \{1,...,n\}^2 , : minimize \sum_i w_i \left( y_i-a_i \right)^2 with respect to \mathbf{y} \in \Bbb{R}^n subject to y_i \ge y_j for all (i,j)\in E If all weights equal to 1, the problem is called ''unweighted isotonic regression'', otherwise it is called ''weighted isotonic regression''. Example Minimize 3(y_1-2)^2+(y_2-1)^2+(y_3-4)^2+2y_4^2 subject to : y_1\le y_2 : y_2\le y_3 : y_2\le y_4 Linear ordering Of a particular interest is ''linear ordering isotonic regression'': : Given :: values \mathbf{a} \in \Bbb{R}^n , :: weights \mathbf{w} \in \Bbb{R}^n such that w_i>0 for all i, : minimize \sum_i w_i \left( y_i-a_i \right)^2 with respect to \mathbf{y} \in \Bbb{R}^n subject to y_1 \le y_2\le\;...\;\le y_n For linear ordering isotonic regression, there is a simple linear algorithm, called [[#Pool Adjacent Violators Algorithm|Pool Adjacent Violators Algorithm]] (PAVA). If all weights are equal 1, the problem is called ''unweighted linear ordering isotonic regression''. Example * Isotonic regression for 10^6 points. * Left: points (x_i, a_i) , where a_i=0 \text{ or }1 . The probability that a_i=1 is determined by logistic function. Only 1000 points shown. * Right: outcome of isotonic regression (black graph) versus logistic function (red graph). The logistic function had been restored with high precision. Other variants Non-euclidean metrics Sometimes other metrics are used instead of the Euclidean metric, for instance L_1 metrics: : \sum_i w_i |y_i-a_i|\! or unweighted L_\infty metrics: : \max_i |y_i-a_i| Points at a grid Sometimes, each value a_i corresponds to a point x_i at 2d or higher-dimensional grid. The fit value must increase at each dimension: : y_i \le y_j \text{ if } x_{ik} \le x_{jk} \text{ for all } k Algorithms Pool Adjacent Violators Algorithm Pool Adjacent Violators Algorithm (PAVA) is a linear time (and linear memory) algorithm for [[#Linear ordering|linear ordering isotonic regression]]. Preliminary considerations The algorithm is based on the following theorem: '''Theorem''': For an optimal solution, if a_i \ge a_{i+1} , then y_i = y_{i+1} '''Proof''': suppose the opposite, i. e. y_i < y_{i+1} . Then for sufficiently small \varepsilon , we can set : y_i^\mathrm{new} = y_i + w_{i+1} \varepsilon : y_{i+1}^\mathrm{new} = y_{i+1} - w_i \varepsilon which decreases the sum \sum_i w_i(y_i-a_i)^2\! without violating the constraints. Therefore, our original solution was not optimal. Contradiction. Since y_i = y_{i+1} , we can combine both points (w_i, a_i) and (w_{i+1}, a_{i+1}) to a new point \left({w_i a_i + w_{i+1} a_{i+1} \over w_i + w_{i+1}}, w_i + w_{i+1}\right) . However, after we combine two points (w_i, a_i) and (w_{i+1},a_{i+1}) to the new point \left(w_i^\prime, a_i^\prime\right) , this new point can violate the constraint a_{i-1}\le a_i^\prime . In this case it should be combined with the (i-1)-th point. If the combined point violates its constraint, it should be combined with the previous point, etc. The algorithm '''Input''': * values a_1,...,a_n * weights w_1,...,w_n '''Output''': * non-decreasing values y_1,...,y_n minimizing \sum_i w_i (y_i-a_i)^2 \! '''Algorithm''' : a'_1 \leftarrow a_1 : w'_1 \leftarrow w_1 : j \leftarrow 1 : S_0 \leftarrow 0 : S_1 \leftarrow 1 :for i=2,3,...,n do :: j \leftarrow j+1 :: a'_j \leftarrow a_i ::while j>1 and a'_j < a'_{j-1} do ::: a'_j \leftarrow {w'_j a'_j + w'_{j-1} a'_{j-1} \over w'_j + w'_{j-1}} ::: w'_j \leftarrow w'_j + w'_{j-1} ::: j \leftarrow j-1 :: S_j \leftarrow i :for k=1,2,...,j do ::for l=S_{j-1}+1,...,S_j do ::: y_l=a'_k Here S defines to which old points each new point corresponds. Arbitrary case algorithms In the arbitrary case, this can be solved as a quadratic problem. The best algorithm takes \Theta(n^4) time, see: * [http://www.jstor.org/discover/10.2307/170640 Maxwell, WL and Muckstadt, JA (1985), "Establishing consistent and realistic reorder intervals in production-distribution systems", ''Operations Research'' 33, pp. 1316-1341. ] * [http://www.springerlink.com/content/q37356540773k6x7/ Spouge J, Wan H, and Wilbur WJ (2003), "Least squares isotonic regression in two dimensions", J. Optimization Theory and Apps. 117, pp. 585-605. ] Implementations R isoreg The function '''isoreg''' performs unweighted linear ordering isotonic regression. It does not require any packages. For many simple cases, it is enough. Example of usage: x=sort(rnorm(10000)) y=x+rnorm(10000) y.iso=isoreg(y)$yf plot(x,y,cex=0.2); lines(x,y.iso,lwd=2,col=2) Iso The package '''Iso''' contains three functions: * pava - linear ordering isotonic regression, weighted or unweighted. * biviso - 2-d isotonic regression * ufit - unimodal order (increasing then decreasing) Example of usage: install.packages("Iso") # should be done only once library("Iso") # should be done once per session x=sort(rnorm(10000)) y=x+rnorm(10000) y.iso=pava(y) plot(x,y,cex=0.2); lines(x,y.iso,lwd=2,col=2) isotone '''isotone''' is the most advanced package. * gpava - linear ordering isotonic regression, weighted or unweighted, for any metrics * activeSet - general isotonic regression for any metrics Example of usage: install.packages("isotone") # should be done only once library("isotone") # should be done once per session x=sort(rnorm(10000)) y=x+rnorm(10000) y.iso=gpava(y)$x plot(x,y,cex=0.2); lines(x,y.iso,lwd=2,col=2) Comparison of speed Since all three libraries somehow implement the PAVA algorithm, we can compare their speed. As we can see, for unweighted linear order isotonic regression (LOIR) with n>200 , '''isoreg''' should be used. For weighted LOIR and unweighted LOIR with n \le 200 , '''pava''' should be used. '''gpava''' should never be used. Also, the implementations of weighted simple ordering isotonic regression on R are far from perfect. Java [http://www.cs.waikato.ac.nz/~ml/weka/ Weka], a free software collection of machine learning algorithms for data mining tasks written in the University of Waikato, contains, among others, an [http://weka.sourceforge.net/doc.stable/weka/classifiers/functions/IsotonicRegression.html isotonic regression classifier]. The classifier is deeply engrained into the Weka's hierarchy of classes and cannot be used without Weka. Usage Calibration of output for categorical probabilities ''For more information, see "[http://www.cs.cornell.edu/~alexn/papers/calibration.icml05.crc.rev3.pdf Predicting Good Probabilities With Supervised Learning]", by Alexandru Niculescu-Mizil and Rich Caruana. '' Most of the supervised learning algorithms, for example Random Forests [http://escholarship.org/uc/item/35x3v9t4#page-1], boosted trees, SVM etc. are good in predicting the most probable category, ''but not the probability''. Some of them tend to overestimate high probabilities and underestimate low probabilities. (One notable exception is neural networks, which themselves produce a well calibrated prediction.) In order to obtain the correct probability, unweighted linear order isotonic regression can be used: : \hat y^\mathrm{final} = f(\hat y) where ** \hat y^\mathrm{final} is a final prediction for probability ** \hat y is a prediction given by the model ** f is a non-decreasing function In order to find f, model's predictions on the validation set are matched with output variable, and the isotonic regression is applied to the pairs. Alternative approach is to pass \hat p via a sigmoid function: : \hat y^\mathrm{final} = {1\over 1 + e^{a-b\hat y}} This is called Platt calibration. At small dataset sizes, Platt calibration is better than isotonic regression. Starting at 200-5000 observations, the isotonic regression surpasses Platt calibration slightly. Note also that this kind of isotonic regression is easier and faster than the logistic regression needed by Platt calibration. Using for recommendation engines The problem We have a model M which predicts residuals r of another model, trained on the training set. For cases with small number of users or items, the model is overfitted, and predictions need relaxation: : \hat r_{ui}^\mathrm{final} = f(\hat S_{ui})r_{ui} where ** \hat r_{ui}^\mathrm{final} is a final prediction for user u, item i ** \hat r_{ui} is a prediction given by the model ** f is a non-decreasing function ** S_{ui} may be either S_u (number of observations with user u in the training set), or S_i (number of observations with item i in the training set), or \min(S_u, S_i) , depending on the nature of the model. The letter S stands for Support. Solution Given set of triplets (\hat r_k, r_k, S_k) from validation database, we have: : \min_f \sum_k \left(r_k - \hat r_k f(S_k)\right)^2 = \min_f \sum_k \hat r_k^2 \left( f(S_k) - \frac{r_k}{\hat r_k} \right)^2 + \operatorname{const} Therefore, for k-th observation we should set weight w_k=\hat y_k^2 and value y_k = r_k / \hat r_k . The triplets (S_k, w_k, y_k) should be sorted with respect to S, then the isotonic regression should be applied to them. If a function directly predicts the rating or acceptance rate, rather than predicting the residual of another model, we can define r_{ui} as the difference between this model and some simpler model (i. e. average rating for the item plus average rating correction for the user). External links * [http://web.eecs.umich.edu/~qstout/IsoRegAlg.html Isotonic Regression Algorithms], by Quentin F. Stout - review of the best current existing algorithms Articles * Predicting Good Probabilities With Supervised Learning, by Alexandru Niculescu-Mizil, Rich Coruana * Probabilistic Outputs for Support Vector Machines and Comparisons to Regularized Likelihood Methods, by John C. Platt Comments